Improved WOA and its application in feature selection

Feature selection (FS) can eliminate many redundant, irrelevant, and noisy features in high-dimensional data to improve machine learning or data mining models’ prediction, classification, and computational performance. We proposed an improved whale optimization algorithm (IWOA) and improved k-nearest neighbors (IKNN) classifier approaches for feature selection (IWOAIKFS). Firstly, WOA is improved by using chaotic elite reverse individual, probability selection of skew distribution, nonlinear adjustment of control parameters and position correction strategy to enhance the search performance of the algorithm for feature subsets. Secondly, the sample similarity measurement criterion and weighted voting criterion based on the simulated annealing algorithm to solve the weight matrix M are proposed to improve the KNN classifier and improve the evaluation performance of the algorithm on feature subsets. The experimental results show: IWOA not only has better optimization performance when solving benchmark functions of different dimensions, but also when used with IKNN for feature selection, IWOAIKFS has better classification and robustness.


The introduction paragraph should be presented more extensively.
Answering the question: In order to explain the feasibility of this study more broadly, two aspects have been added. Answering the question: In response to this problem, we have added the conclusion part of the paper. There are two additions: ①The innovation and advantages and disadvantages of the method proposed in this paper.
②Future research work progress. It consists of three parts, They are the theoretical analysis system and evaluation system of the meta-heuristic algorithm, and the community communication module; how to reduce the time complexity of IWOAIKFS and how to build a large data set preprocessing system based on IWOAIKFS.
For details, please refer to the Conclusions section of the revised manuscript.
6. it will be good to provide pros and cons of new proposed method.
Answering the question: In this paper, we propose three improved algorithms: IWOA, IKNN and IWOAIKFS, in response to your opinion, this article gives the advantages and disadvantages of the three improved algorithms, as follows: Advantage of IWOA: Enhances the exploration ability and fast convergence ability of the algorithm in the solution space, and has better convergence performance and searching performance on most unimodal and multimodal functions Disadvantage of IWOA: It is not suitable for all function optimization. For multimodal functions, the convergence speed is slow and the time complexity is slightly higher.
Advantage of IKNN: It can better distinguish the similarity between samples, and has better classification performance and robust performance.

3: Use Friedman statistical test to better evaluation of a proposed method.
Answering the question: In response to this problem, in order to better evaluate the effectiveness of the method proposed in this paper, this paper adds Wilcoxon test and Friedman test to the IWOA algorithm, see Section 5.2.3 of the revised manuscript for details. Added Friedman test to the IWOAFS algorithm, see Section 5.4.4 of the revised manuscript for details. The test results are as follows, for a detailed explanation, see Sections

and 5.4.4 of the revised manuscript.
①Wilcoxon test of IWOA   Answering the question: In order to make the proposed algorithm suitable for the feature selection problem, this paper maps the continuous search space to the binary space. The main method is to take 1 when the algorithm fitness is greater than 0.5, and take 0 when it is less than or equal to 0.

(3) Time complexity analysis of IWOAIKFS (Section 4.3 of the revised manuscript)
Since IWOAIKFS is a FS method obtained by IWOA optimizing IKNN, its time complexity can be divided

6: What is the value of K in the KNN algorithm? Please explain the main reason for the Kvalue used.
Answering the question: In this paper, the value of K in the KNN algorithm is 5. There are two main reasons for the K value to be 5:

7: Please add future work to the conclusion section and discuss it briefly.
Answering the question: Future work has been added to the conclusion section.
Although the three improved methods proposed in this paper have better performance than the original algorithm, they still have some shortcomings. For example, IWOA has poor convergence performance when dealing with high-dimensional multimodal functions, and the time complexity of IKNN and IWOAIKFS is too high. Therefore, we will conduct further research on these issues in the future, as follows.
(1) In the future, we plan to build a theoretical analysis system and evaluation system for meta-heuristic algorithms, as well as a community communication module. Due to the problem of over-using "metaphor" in the meta-heuristic algorithm, in order to better distinguish the new meta-heuristic algorithm Whether (or improving the algorithm) can promote the research in the field of optimization. the follow-up research in this paper will try to establish a theoretical analysis system and evaluation system and a community communication module for the corresponding meta-heuristic algorithm.
(2) In the future, we plan to try to reduce the time complexity of IWOAIKFS. Since IWOAIKFS is the fusion of IWOA and IKNN algorithm, and influenced by IKNN algorithm, its time complexity is much higher than that of common feature selection methods. Therefore, follow-up research will try to integrate the training and testing processes in IKNN to reduce the time complexity of the IKNN algorithm, thereby reducing the time complexity of IWOAIKFS.
(3) In the future, we plan to build a large data set preprocessing system based on IWOAIKFS. After we have built the evaluation framework of the meta-heuristic algorithm and reduced the time complexity of IWOAIKFS, we can try to build a large data set preprocessing system based on IWOAIKFS, which is used to quickly process complex data sets for faster entry into machine learning.

8: What is the main reason for choosing the Gauss/mouse Chaos map, please elaborate more
Answering the question: The more evenly distributed the initial population is in the solution space, the greater the probability that the algorithm finds the optimal value. Compared with random search strategy, chaotic search is widely used in the generation of initial population due to its randomness, ergodicity, non-repetition and other characteristics. However, different chaotic maps have different effects on the initial population of the algorithm. Therefore, in this paper, through the analysis and comparison of Rand, Gauss map, Tent map and Chebyshev map, the chaotic map suitable for the whale optimization algorithm is selected. The initial population and the original initial population generated by the three chaotic maps are shown in Fig. 1. In Fig.1, Figures 1(a),1(b),1(c), and 1(d) represent the original initial whale population, the initial whale population generated by Tent mapping, the initial whale population generated by Chebyshev mapping, and the initial whale population generated by Gauss Generated initial whale population. As can be seen from Figure 1, from the point of view of the generation of the initial whale population, the whale population generated by Gauss map is more evenly distributed in space, which provides a better guarantee for the global optimization of the algorithm.

Answering the question:
In the feature selection method, the commonly used evaluation criteria include the number of features selected, the average classification accuracy of the feature subset, the standard deviation and the convergence curve, and these evaluation criteria have been given in the manuscript, so no other criteria will be used to evaluate. But in order to better demonstrate the effectiveness of the proposed method, we draw a boxplot of the classification accuracy of all optimizers under 15 datasets for 30 independent experiments. In Fig.   12, the lower quartile ( 1 Q ) represents lower values, the upper quartile ( 3 Q ) represents higher values, and the red line in the box represents the median value. It can be seen from Fig. 12 that IWOAIKFS ranks first in performance among all algorithms, and has the best performance in 15 datasets.

The last, I am very grateful to your comments for the manuscript. Thank you very much
for taking time out of your busy schedule to read our manuscript and for giving us many valuable suggestions that enrich our manuscript. best wishes to you!